Unstructured data editing through category comparison

ABSTRACT

Embodiments of the present invention include methods for editing and scanning unstructured data and text by using one or more external categories of data for the purpose of finding words and phrases in the unstructured environment which correspond to words and phrases in the external category. External categories of data are words and phrases that relate to the external category. External categories can be made for practically any subject. When a match (“hit”) is found, an output record is written to a table or a file. The output record may include the document name, the word that was a hit, and the external category. The process of using external categories of data is done either directly or indirectly to unstructured data.

CROSS-REFERENCE TO RELATED APPLICATIONS

This invention claims the benefit of priority from U.S. ProvisionalApplication No. 60/729,830, filed Oct. 25, 2005, entitled “UnstructuredData Editing Through Category Comparison.”

BACKGROUND

The present invention relates to processing unstructured and structureddata, and in particular, to unstructured data editing through categorycomparison.

Unstructured data typically comes in the form of email, transcriptedtelephone conversations, spreadsheets, documents, letters, and otherforms. Individuals and corporations have used unstructured data for along time. As the name suggests, there is no structure to unstructureddata. There are no rules for writing emails. There are no rules forhaving a telephone conversation. Instead with unstructured dataeverything is free form.

Juxtaposed to unstructured data is structured data. Structured data isdata that is formatted into records, tables and attributes. Typicalcomputerized operating systems and database management systems operateon structured data. Structured records are typically placed in a file.Once in a file or a database, the records can be accessed and used for avariety of purposes. With structured data there is a regularity of thecontents of the data. The same type of data appears and reappears in thedifferent records. Structured data is ideal for computerized transactionprocessing, where bank transactions, airline reservations, insuranceclaims, manufacturing assembly work and so forth are executed.

For years organizations have had both kinds of systems in theirenvironment—unstructured data and structured data. For years thesedifferent environments have grown up beside each other. But there hasbeen very little interaction between these environments. It is as if thetwo environments operated in complete isolation from each other. Thereis however great value in being able to merge and intertwine these twoenvironments. Many different business opportunities emerge that wouldhave not been possible had the two environments remained separate. Asone simple example of the opportunities that arise when the two worldsare merged together, consider CRM—customer relationship management. Incustomer relationship management the organization attempts to form aclose relationship with its customers and its prospects. Theorganization collects demographic data about the customer. But whencommunications—emails, telephone conversations, other documents—areadded to the fray, the ability to get to know the customer isexponentially enhanced. And emails, telephone conversations, anddocuments are all forms of unstructured information. Therefore, fororganizations that want to engage in CRM, adding unstructured data tothe structured CRM environment enables entirely new and powerful typesof processing. There are many other important examples of possibilitiesof applications when the gap between structured data and unstructureddata is bridged. Other applications include monitoring of compliance,such as compliance to Sarbanes Oxley, HIPAA and Basel II, theenforcement of standards, and so forth.

There are many problems associated with merging structured data andunstructured data. One of the major problems is the internalorganization of the data itself. In a word, structured data is highlycontrolled and disciplined. There is strict control over structureddata. But there is little or no control or discipline for unstructureddata. The result is that when the two types of data are merged, there isa colossal mismatch. If you want anything meaningful, you simply do notmerge structured data and unstructured data together. In order to haveany meaningful merger of structured and unstructured data, it isnecessary to carefully manipulate the unstructured data (e.g., text) sothat the unstructured data can be placed in a form and format that iscompatible with and useful to structured data.

One of the many problems of preparing unstructured data for merger withstructured data is that of determining what words and phrases in theunstructured text are relevant and useful to business problems. This isespecially important in light of the many different meanings of the sameword or phrase in the English language. For example, the word—“book” canmean very different things. The meaning of “I read a book on theairplane trip.” is quite different from “I was booked into jail lastnight.” The English language is full of such homographs. What is neededis a way to resolve the different meanings of words and to relate thosewords to business problems and issues.

Thus, there is a need for improved the bridge between unstructured andstructured data. The present invention solves these and other problemsby providing unstructured data editing through category comparison.

SUMMARY

Embodiments of the present invention include techniques for unstructureddata editing through category comparison. In one embodiment, the presentinvention includes a method of processing unstructured data comprisingspecifying a first plurality of words or phrases corresponding to acategory, accessing unstructured data comprising a second plurality ofwords or phrases, comparing the unstructured data against each of thespecified words or phrases, associating at least a portion of theunstructured data with the category if one or more of the specifiedwords or phrases matches at least one word or phrase in the portion ofthe unstructured data, and generating a structured data output.

In one embodiment, the structured data output comprises anidentification of an unstructured document, a matching word or phrase,and a name of the category.

In one embodiment, the structured data output comprises at least aportion of the unstructured data, at least one matching word or phrasein the unstructured data and the category.

In one embodiment, the structured data output is a structured record.

In one embodiment, the structured data output is generated in a list.

In one embodiment, the structured data output is generated in adatabase.

In one embodiment, the structured data output is generated in a table.

In one embodiment, the method further comprises reading the unstructureddata into a file, and accessing the unstructured data from the file.

In one embodiment, the method further comprises reading the unstructureddata directly from the unstructured data source.

In one embodiment, the unstructured data comprises a plurality ofemails.

In one embodiment, the unstructured data comprises a plurality ofspreadsheets.

In one embodiment, the unstructured data comprises plurality oftranscribed telephone conversations.

In one embodiment, the unstructured data comprises one or moreelectronic files comprising a plurality of words or phrases.

In one embodiment, the unstructured data comprises textual data.

In one embodiment, the category comprises accounting.

In one embodiment, the category comprises finance.

In one embodiment, the category comprises sales.

In one embodiment, the category comprises Sarbanes Oxley.

In one embodiment, the category comprises manufacturing.

In one embodiment, the category comprises marketing.

In one embodiment, the category comprises human resources.

In one embodiment, the category is generated from the unstructured data.

In one embodiment, the category is an external category.

In one embodiment, the category comprises a name and a plurality ofassociated words or phrases.

The following detailed description and accompanying drawings provide abetter understanding of the nature and advantages of the presentinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates the structured and the unstructured environments.

FIG. 2 illustrates the bridge that is needed in order to cross the gapbetween the two environments.

FIG. 3 illustrates text gathered from a wide variety of unstructuredsources.

FIG. 4 illustrates two categories formed from the text found in theunstructured environment.

FIG. 5 illustrates an external category.

FIG. 6 illustrates that external categories can come from anywhere.

FIG. 7 illustrates example external categories.

FIG. 8 illustrates direct and indirect techniques for the usage of andexecution against an external category.

FIG. 9 shows the dynamics of a direct external category search.

FIG. 10 shows the dynamics of an indirect external category search.

FIG. 11 shows that multiple external categories used during anunstructured data search.

FIG. 12 shows that the same word may appear in more than one externalcategory.

FIG. 13 shows that external categorization processing can occur inconjunction with other unstructured editing.

FIG. 14 shows the content of the output from the external data matchingprocess.

DETAILED DESCRIPTION

Described herein are systems and methods for bridging data between anunstructured and structured environment. In one embodiment, the presentinvention includes using external categories for the purpose ofunderstanding what is inside unstructured text. In the followingdescription, for purposes of explanation, numerous examples and specificdetails are set forth in order to provide a thorough understanding ofthe present invention. It will be evident, however, to one skilled inthe art that the present invention as defined by the claims may includesome or all of the features in these examples alone or in combinationwith other features described below, and may further include obviousmodifications and equivalents of the features and concepts describedherein.

Embodiments of the present invention include unstructured bridgingsoftware that may be used to capture, organize, store, and displayunstructured data and prepare that unstructured data for the purpose ofintegrating it with and sending it to the structured environment. Theeditor for this purpose is called the “foundation” or the “editor.” Inparticular, the foundation can access many forms of unstructured data,including spreadsheets, transcribed telephone conversations, documents,emails, and many other forms of textual unstructured information. In oneembodiment, at the point of accessing unstructured data, a lookup may beperformed against words and phrases in external or internal categoriesof data. For example, one or more words or phrases corresponding to aparticular category may be specified. If the foundation software finds amatch between a word or phrase in unstructured data and a specified wordor phase, the word that has been matched, the document id, and theexternal category name, for example, may be written out to a simple listor data base. The match is called a “hit.” The output table is thenavailable for processing in the structured environment.

Embodiments of the present invention include methods of scanning andediting unstructured data for the purpose of comparing the unstructureddata against words and phrases found in the external categories whichhave been constructed by the organization. The invention may includeseveral components: one or more external categories (e.g., a list ofwords and phrases which are relevant to or important to the topic of theexternal category), a body of unstructured text, an editor program whichdoes the comparisons, and an output list of the “hits,” for example.

Once unstructured text is ready for processing, the unstructured text isexamined a word and phrase at a time to determine if there is a matchwith any word in the words and phrases found in the external categories.If a match is found, the word that has been matched, its sourcedocument, and its external category may be written to the output tableor database. In one embodiment, the present invention uses the techniqueof external categorization matching against unstructured data.

Two kinds of categorizations of text can be created—an internalcategorization and an external categorization. The first kind ofcategorization—internal categorization—is created by looking only at thewords found in the unstructured environment. In an internalcategorization the words inside the unstructured environment are takenand manipulated to create the major “theme” or categories of data.Internal categorizations differ from external categorizations. Anexternal categorization of data is created externally to the text ordata found inside the unstructured text. The external data can come fromanywhere. Indeed there may be no match between any words or phrasesfound in the external categorization and the unstructured data or text.There may also be a significant intersection between the twoenvironments.

The technique of external category processing against unstructured datafor the purpose of understanding the unstructured data begins with anexternal category. An external category has a name such as SarbanesOxley, accounting, human resources, etc. The name reflects the generalorientation of the words that will be found in the category. Theexternal category contains a list of words and phrases. The words andphrases are all essential and/or important language relevant to theexternal category. For example, the external category for Sarbanes Oxleymight have the words and phrases “promise to deliver”, “contingentsale”, “delayed payment”, unrecognized revenue”, and so forth. Or theexternal category for human resources might have the words and phrases“race”, “background”, “education”, “GPA”, “college degree”, and soforth. The purpose of placing words and phrases into an externalcategory is to identify words and phrases that are important to a topicthat are in the unstructured document that is being searched orotherwise analyzed. In other words, when the word “revenue” is placed inthe external category for accounting, and the word “revenue” is found inthe unstructured document, it is recognized that the text of theunstructured document is relevant to accounting. A “hit” refers to amatch between a word or phrase in the external category and a word orphrase in the unstructured document. Upon finding a “hit”, the word“revenue” creates an entry in a separate table. The data found in theseparate table may include the name of the source document, the wordthat has been matched (or “hit”), and the external category, forexample.

As an example, suppose the word “revenue” is found in an externalcategory for accounting. Suppose an unstructured document known asABCDE123 is being analyzed. The resulting hit would produce a record ina list or a database where the entry would look as follows: “docname—ABCDE123; matched word—revenue; external category—accounting.”

Note that the same word may appear in multiple external categories. Forexample the word “revenue” may appear in the external categories ofaccounting, finance, sales, Sarbanes Oxley, and so forth. Externalcategories can come from anywhere. There are no limitations orboundaries for the source of data found in any external data category.

The output of the “hits” or matches may be sent to a table or a list.The table can be in the form of a simple list. The table can be in adatabase, for example. The structure of the database may be very similarto a relational flat file. Once the simple list or database is created,the data is then available for processing in the structured environment.

The simple output table tells the viewer where in the unstructured worldthere is data that relates to the different external categories. Theediting pass of the unstructured data can use multiple externalcategories of data. There is no theoretical limit as to how manyexternal categories that can be used (e.g., all at the same time) inediting and scanning the unstructured data.

In another embodiment, the external categories of data can be indifferent languages. One external category can be in French, anotherexternal category can be in English, and another external category canbe in Spanish. There is no language limitation on the differentlanguages that can be mixed together.

FIG. 1 illustrates the two environments—the structured environment 102and the unstructured environment 101. Features and advantages of thepresent invention include analyzing unstructured data 101 and convertingthe unstructured data into a structured format for movement into thestructured environment 102 as shown by arrow 103. The structuredenvironment 101 is made up of records, tables, attributes, dataelements, and database management systems. The unstructured environmentis made up of emails 110, documents 120, spreadsheets 140, telephoneconversations, and other forms of textual data (e.g., .txt files 130),for example.

FIG. 2 illustrates a bridge 210 between the two environments. The bridge210 is quite useful in that applications can be written that incorporateboth kinds of data. The bridge is very difficult to build because of theextremely different nature of data in both environments. Unstructureddata 201 simply has no structure. On the other hand structured data 202requires structure. Therefore the bridge between the two worlds is muchmore complex than just a mere search engine. Embodiments of theinvention include a bridge 210 that reads unstructured data sources andreceives one or more categories 230, as described above, for creatingstructured data from unstructured data.

FIG. 3 shows that the foundation software 310 can read unstructured datafrom many sources. Text may be gathered from different sources andconverted into a structured format. Typical sources are spreadsheets301, documents 302, emails 303, telephone conversations that have beentranscribed 304, or other textual sources (e.g., .txt files 305). In thecase of telephone conversations, telephone discussions are usuallytaped. Then the tapes are transcribed into an electronic textual form.The input seen by the foundation software is the textual form of data.By the time the data arrives at the foundation software, it is justtextual data that has happened to originate from different sources.

FIG. 4 shows that the output of foundation processing can be dividedinto two classes. As illustrated in this example, text may be gatheredfrom many different sources. Once text has been gathered, it can be usedto create internal categories 401 of data. Internal data is data andanalysis of that data that is generated entirely from the unstructuredsources. Alternatively, the data can be associated with an externalcategory. External data is data that relates to one or more externalcategories of data. There may be no intersection of data betweenunstructured text or there may be a considerable intersection. Theamount of the intersection depends on what the unstructured data relatesto and what external categories are used.

FIG. 5 illustrates an external category 500. An external category mayinclude a category name and words and phrases that relate to thecategory. In addition, the words and phrases inside the externalcategory can have their own internal structuring within the externalcategory.

FIG. 6 illustrates that external categories of words and phrases cancome from anywhere. They can come from different geographies. They cancome from different disciplines. They can come from differentdepartments. There simply is no boundary that limits where the sourcesof external categories can come from.

FIG. 7 illustrates some typical external categories of data. Categoriesmay include accounting, ethics, HIPAA (i.e., a national health careinformation standard), marketing, human resources, customer companies,Basel II (i.e., an international financial information standard), sales,or Sarbanes-Oxley, for example.

FIG. 8 shows two example ways that foundation editing and processing canbe done. One way is to do editing directly at the point of reading theunstructured data. The other way is indirectly, after the unstructureddata is “screened” and “filtered.” In either case, external categorycomparisons can be done in conjunction with other processing against theunstructured data.

FIG. 9 shows the dynamics of a direct comparison of unstructured data tothe contents of the external category. In the case shown, theunstructured data is read a word or phrase at a time. The unstructuredword that has been read is compared with the words and phrases in theexternal category. If there is no match, nothing happens. But if thereis a match, an output record is written. The output record may includethe identification of the document, the word on which there has been amatch, and the name of the external category. The process may berepeated for each of the unstructured words. As exemplified in FIG. 9,bridge software 910 receives unstructured data words or phrases. Stepsof a direct external category search may begin at 901, whereunstructured data is searched sequentially. As shown at 902, uponencountering a word or phrase in the unstructured text, the word orphrase is passed against the words or phrases found in an externalcategory 920. At 903, if a hit is found, the word or phrase, the text id(e.g., identifying the unstructured document), and the category may beplaced in a “hit” table or database. At 904, after one unstructured wordor phrase is processed, the next unstructured word or phrase isprocessed, for example.

FIG. 10 shows an indirect usage of the foundation software. In theindirect case the unstructured document is read word by word by softwarecomponent 1001. The data may be read and sent to a temporary or workfile 1002, for example. The unstructured data is edited for other kindsof processing and may then be placed in the work file. The data may thenbe re-read and processed against the words and phrases found in theexternal category 1004 of data by software component 1003. When a hit isfound an output record 1005 may be written to the output file or database. As exemplified in FIG. 10, the steps of an indirect externalcategory search include sequentially searching unstructured text at1011. At 1012, a screen may be used for selecting certain words orphrases for further screening—created a screened list. At 1013, uponencountering a word or phrase in the unstructured text, the word orphrase is passed against the words found in an external category. At1014, if a hit is found, the word or phrase, the text id, and thecategory are placed in a “hit” table or database. At 1015, after oneunstructured word or phrase is processed, the next unstructured word orphrase from the screened list is processed. It is to be understood thatthe above two examples showing direct and indirect processing are onlyexamples. Features and embodiments of the present invention may beimplemented into systems in a variety of different ways.

FIG. 11 shows that multiple external categories of words and phrases1101-1104 can be used for editing. It is not necessary to have a singleexternal category of data to be used for editing purposes. Thus, therecan be one or more external categories used against the unstructureddata. The same word may appear in more than one external category.

FIG. 12 shows that the same word or phrase can appear in multipleexternal categories. In this example, the same word 1201 may appear incategory 2 (“eword5”), category 3 (“eword2”), category 4 (“eword1”), andcategory 1 (“eword4”). The words or phrases may appear in differentpositions in the different categories, for example.

FIG. 13 shows that editing based on external categorization can be usedin conjunction with other editing and manipulation of unstructured dataand text. In this example, a first software component 1301 may performsome processing of the unstructured data before bridge component 1302generates records based on category 1303. Other types of processing mayoccur before, after, or in parallel with categorization processing, forexample.

FIG. 14 shows the output of foundation processing using externalcategories as a basis for scanning data. In this example, softwarecomponent 1401 receives unstructured text 1404 and external category1403. The output is a structured list 1402, which may be a flat file,for example.

The above description illustrates various embodiments of the presentinvention along with examples of how aspects of the present inventionmay be implemented. The above examples and embodiments should not bedeemed to be the only embodiments, and are presented to illustrate theflexibility and advantages of the present invention as defined by thefollowing claims. Based on the above disclosure and the followingclaims, other arrangements, embodiments, implementations and equivalentswill be evident to those skilled in the art and may be employed withoutdeparting from the spirit and scope of the invention as defined by theclaims.

1. A method of processing unstructured data comprising: specifying afirst plurality of words or phrases corresponding to a category;accessing unstructured data comprising a second plurality of words orphrases; comparing the unstructured data against each of the specifiedwords or phrases; associating at least a portion of the unstructureddata with the category if one or more of the specified words or phrasesmatches at least one word or phrase in the portion of the unstructureddata; and generating a structured data output.
 2. The method of claim 1wherein the structured data output comprises an identification of anunstructured document, a matching word or phrase, and a name of thecategory.
 3. The method of claim 1 wherein the structured data outputcomprises at least a portion of the unstructured data, at least onematching word or phrase in the unstructured data and the category. 4.The method of claim 1 wherein the structured data output is a structuredrecord.
 5. The method of claim 1 wherein the structured data output isgenerated in a list.
 6. The method of claim 1 wherein the structureddata output is generated in a database.
 7. The method of claim 1 whereinthe structured data output is generated in a table.
 8. The method ofclaim 1 further comprising reading the unstructured data into a file,and accessing the unstructured data from the file.
 9. The method ofclaim 1 further comprising reading the unstructured data directly fromthe unstructured data source.
 10. The method of claim 1 wherein theunstructured data comprises a plurality of emails.
 11. The method ofclaim 1 wherein the unstructured data comprises a plurality ofspreadsheets.
 12. The method of claim 1 wherein the unstructured datacomprises plurality of transcribed telephone conversations.
 13. Themethod of claim 1 wherein the unstructured data comprises one or moreelectronic files comprising a plurality of words or phrases.
 14. Themethod of claim 1 wherein the unstructured data comprises textual data.15. The method of claim 1 wherein the category comprises accounting. 16.The method of claim 1 wherein the category comprises finance.
 17. Themethod of claim 1 wherein the category comprises sales.
 18. The methodof claim 1 wherein the category comprises Sarbanes Oxley.
 19. The methodof claim 1 wherein the category comprises manufacturing.
 20. The methodof claim 1 wherein the category comprises marketing.
 21. The method ofclaim 1 wherein the category comprises human resources.
 22. The methodof claim 1 wherein the category is generated from the unstructured data.23. The method of claim 1 wherein the category is an external category.24. The method of claim 1 wherein the category comprises a name and aplurality of associated words or phrases.
 25. A method of processingunstructured data comprising: specifying one or more categories, eachcategory comprising a first plurality of words or phrases; readingunstructured data comprising a second plurality of words or phrases;comparing the unstructured data against the words or phrases in eachcategory; associating at least a portion of the unstructured data withat least one category if one or more words or phrases in the at leastone category matches at least one word or phrase in the portion of theunstructured data; and generating a structured data output.
 26. Themethod of claim 25 wherein the structured data output comprises anidentification of an unstructured document, a matching word or phrase,and a name of the category.
 27. The method of claim 25 wherein thestructured data output comprises at least a portion of the unstructureddata, at least one matching word or phrase in the unstructured data andthe category.
 28. The method of claim 25 wherein the structured dataoutput is a structured record.
 29. The method of claim 25 wherein thestructured data output is generated in a list.
 30. The method of claim25 wherein the structured data output is generated in a database. 31.The method of claim 25 wherein the structured data output is generatedin a table.
 32. The method of claim 25 further comprising reading theunstructured data into a file, and accessing the unstructured data fromthe file.
 33. The method of claim 25 further comprising reading theunstructured data directly from the unstructured data source.
 34. Themethod of claim 25 wherein the unstructured data comprises a pluralityof emails.
 35. The method of claim 25 wherein the unstructured datacomprises a plurality of spreadsheets.
 36. The method of claim 25wherein the unstructured data comprises a plurality of transcribedtelephone conversations.
 37. The method of claim 25 wherein theunstructured data comprises one or more electronic files comprising aplurality of words or phrases.
 38. The method of claim 25 wherein theunstructured data comprises textual data.
 39. The method of claim 25wherein the category comprises accounting.
 40. The method of claim 25wherein the category comprises finance.
 41. The method of claim 25wherein the category comprises sales.
 42. The method of claim 25 whereinthe category comprises Sarbanes Oxley.
 43. The method of claim 25wherein the category comprises manufacturing.
 44. The method of claim 25wherein the category comprises marketing.
 45. The method of claim 25wherein the category comprises human resources.
 46. The method of claim25 wherein the category is generated from the unstructured data.
 47. Themethod of claim 25 wherein the category is an external category.
 48. Themethod of claim 25 wherein the category comprises a name and a pluralityof associated words or phrases.
 49. A computer implemented system forprocessing unstructured data comprising: means for specifying a firstplurality of words or phrases corresponding to a category; means foraccessing unstructured data comprising a second plurality of words orphrases; means for comparing the unstructured data against each of thespecified words or phrases; means for associating at least a portion ofthe unstructured data with the category if one or more of the specifiedwords or phrases matches at least one word or phrase in the portion ofthe unstructured data; and means for generating a structured dataoutput.